Run-Time Selection of Block Size in Pipelined Parallel Programs

نویسندگان

David K. Lowenthal

Michael James

چکیده

Parallelizing compiler technology has improved in recent years. One area in which compilers have made progress is in handling DOACROSS loops, where crossprocessor data dependencies can inhibit e cient parallelization. In regular DOACROSS loops, where dependencies can be determined at compile time, a useful parallelization technique is pipelining, where each processor (node) performs its computation in blocks; after each, it sends data to the next processor in the pipeline. The amount of computation before sending a message is called the block size; its choice, although di cult for a compiler to make, is critical to the e ciency of the program. Compilers typically use a static estimation of workload, which cannot always produce an e ective block size. This paper describes a exible run-time approach to choosing the block size. Our system takes measurements during the rst iteration of the program and then uses the results to build an execution model and choose an appropriate block size which, unlike those chosen by compiler analysis, may be nonuniform. Performance on a network of workstations shows that programs using our run-time analysis outperform those that use static block sizes when the workload is either unbalanced or unanalyzable. On more regular programs, our programs are competitive with their static counterparts.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Run - Time Selection of Block Size in Pipelined

Parallelizing compiler technology has improved in recent years. One area in which compilers have made progress is in handling DOACROSS loops, where cross-processor data dependencies can inhibit eecient parallelization. In regular DOACROSS loops, where dependencies can be determined at compile time, a useful parallelization technique is pipelining, where each processor (node) performs its comput...

متن کامل

Efficient implementation of low time complexity and pipelined bit-parallel polynomial basis multiplier over binary finite fields

This paper presents two efficient implementations of fast and pipelined bit-parallel polynomial basis multipliers over GF (2m) by irreducible pentanomials and trinomials. The architecture of the first multiplier is based on a parallel and independent computation of powers of the polynomial variable. In the second structure only even powers of the polynomial variable are used. The par...

متن کامل

Second - level Instruction Cache Thread Processing Unit Thread Processing Unit Thread Processing Unit Instruction Cache First - level First - level First - level Instruction Cache Instruction Cache Execution

This paper presents a new parallelization model, called coarse-grained thread pipelining, for exploiting speculative coarse-grained parallelism from general-purpose application programs in shared-memory multiprocessor systems. This parallelization model, which is based on the ne-grained thread pipelining model proposed for the superthreaded architecture 11, 12], allows concurrent execution of l...

متن کامل

A Rotate-Tiling Image Composition Method for Parallel Volume Rendering on Distributed Memory Multicomputers

The binary-swap and the parallel-pipelined methods are two popular image composition methods for volume rendering on distributed memory multicomputers. However, these methods either restrict the number of processors to a power of two or require many steps to transform image data that results in high communication overheads. In this paper, we present an efficient image composition method, the ro...

متن کامل

Parallel computing using MPI and OpenMP on self-configured platform, UMZHPC.

Parallel computing is a topic of interest for a broad scientific community since it facilitates many time-consuming algorithms in different application domains.In this paper, we introduce a novel platform for parallel computing by using MPI and OpenMP programming languages based on set of networked PCs. UMZHPC is a free Linux-based parallel computing infrastructure that has been developed to cr...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 1999

Run-Time Selection of Block Size in Pipelined Parallel Programs

نویسندگان

چکیده

منابع مشابه

Run - Time Selection of Block Size in Pipelined

Efficient implementation of low time complexity and pipelined bit-parallel polynomial basis multiplier over binary finite fields

Second - level Instruction Cache Thread Processing Unit Thread Processing Unit Thread Processing Unit Instruction Cache First - level First - level First - level Instruction Cache Instruction Cache Execution

A Rotate-Tiling Image Composition Method for Parallel Volume Rendering on Distributed Memory Multicomputers

Parallel computing using MPI and OpenMP on self-configured platform, UMZHPC.

عنوان ژورنال:

اشتراک گذاری